Discipline background
Part 1: Urban systems science?
Part 2: Spatial data
Pedagogic challenges
Part 3: Data contamination / manipulation
Part 4: Big data
Part 5: Reproducibility
Part 6: Teaching criticality, data bias, reproducibility
Lecturer in Spatial Data Science and Visualization at CASA, UCL
Lead MSc modules in:
Research:
Big data for allocating funding
A set of towns and cities [or functions within cities] that can be considered linked together by various forms of social and economic interaction
Source: Oxford reference
Methods aimed at studying a system through its collective behavioral features
Source: Cristiano et al. 2020
The science of cities – using evidence to understand how cities work – is forever expanding
Source: UK Government
Urban science is an interdisciplinary field that studies diverse urban issues and problems
Source: Wikipedia
Urban Systems: Cities [or functions within cities] that can be considered linked together [there is a relationship between them]
+
Urban Systems: Cities [or functions within cities] that can be considered linked together [there is a relationship between them]
+
Urban Science: Urban issues and problems
Urban Systems: Cities [or functions within cities] that can be considered linked together [there is a relationship between them]
+
Urban Science: Urban issues and problems
=
Smart Cities: networks and services are made more efficient with the use of digital solutions for the benefit of its inhabitants and business.
Source: Smart Cities, European Comission
The same as regular data science but with spatial data
Ran 4 scenarios:
Geographic Coordinate Reference System
Projected Coordinate Reference System
Spatial data is just like normal data except it has an extra “geometry column”
Every 10 years electoral districts are re-drawn “redistricting”– Thomas Hofeller (republican) = PACK and CRACK
“Redistricting is democracy at work” - Tom Hofeller
Big geospatial data include datasets that are too large to be processed using traditional GIS tools
Source: GIS Harvard
Raster
Landsat satellite data: 400 scenes of Earth a day, revising each location every 16 days
Vector
New York City Taxi and Limousine Commission (TLC) all records from Yellow and Green Cabs
Open Street Map
We are moving from row based storage to column based
About 50x faster than a .csv
It groups our data.
For example a row group size of 2, puts rows all the data from 1 and 2 next to each other then we have 3! = GROUPS or PARTITION
If we have large data this means we can skip groups we don’t need
In the R for Data Science book a 9BG .csv
is queried in
Database management system
Columnar data
No installation
Convert our Parquet file to DuckDB and back again!
Regarding performance, parquet is 717 times faster than the same query on a csv file, and duckdb is 2808 times faster.
Source: Christophe Nicault
All (parquet and DuckDB) make sure of dplyr
! select()
, filter()
, groupby()
= direct integration with R
Currently the support for spatial data is very limited
sfarrow - can load and query the data but can’t do any analysis!
5 million random points
Despite all these tools we must start with the basics.
Often this is in Quantum GIS (free) or ArcMap($)
We will be exploring QGIS in the workshop later
2017, 90% of the data in the world today has been created in the last two years alone, at 2.5 quintillion bytes of data a day! - IBM
All of the implementations were tested with the same input data.
They all gave the same results except the ESRI/ArcGIS implementation (Li 2018)
and although ESRI provide help for the GWR tools, the actual coding is closed—the underlying code is not revealed
Source: Brunsdon and Comber, 2021
Traditional labs and were distributed in pdfs, word documents and powerpoints.
Used ArcGIS 💰
Learning happens by doing
Weekly homework that we dedicate time to discussing
You need calculate the average percent of science students (in all) grades per county meeting the required standards
Each practical answers a question….
What are the factors that might lead to variation in Average GCSE point scores across the city?
New York City wish to conduct a study that aims to prevent people being evicted through understand possible related factors.You have been enlisted as a consultant and tasked to conduct an analysis of their data from 2020.
Data:
How were the evictions recorded
Why were there limited evictions during 2020/ then a sudden peak? - COVID ban on evictions
How can identifying spatially related factors to evictions be useful
Are there certain areas that have higher evictions than others - why might this be?
What assumption does the data make
What assumptions does the model make
Students
Click the URL and generates a new repository
Staff can see their work and when they make edits (commit / push)
It is essential to use data to inform decisions…BUT we must develop a critical awareness of:
In addition we must recognize that:
Scientists must have a say in the future of cities, McPhearson 2016
Pedagogic challenges 2: urban systems science • Andy MacLachlan